Possible Approach

Hypothesis, does adding Pos, Neg, and Neu values from Sentiment Analysis improve the original model??


In [24]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.head()


Out[24]:
label review
0 neg how do films like mouse hunt get into theatres...
1 neg some talented actresses are blessed with a dem...
2 pos this has been an extraordinary year for austra...
3 pos according to hollywood movies made in last few...
4 neg my first press screening of 1998 and already i...

In [25]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [26]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

In [27]:
df.head()


Out[27]:
label review
0 neg how do films like mouse hunt get into theatres...
1 neg some talented actresses are blessed with a dem...
2 pos this has been an extraordinary year for austra...
3 pos according to hollywood movies made in last few...
4 neg my first press screening of 1998 and already i...

In [28]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

In [29]:
df.head()


Out[29]:
label review scores
0 neg how do films like mouse hunt get into theatres... {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...
1 neg some talented actresses are blessed with a dem... {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...
2 pos this has been an extraordinary year for austra... {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...
3 pos according to hollywood movies made in last few... {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...
4 neg my first press screening of 1998 and already i... {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...

In [30]:
df['positive'] = df['scores'].apply(lambda score_dict: score_dict['pos'])
df['negative'] = df['scores'].apply(lambda score_dict: score_dict['neg'])
df['neutral'] = df['scores'].apply(lambda score_dict: score_dict['neu'])
df['compound']  =df['scores'].apply(lambda score_dict: score_dict['compound'])

In [31]:
df.head()


Out[31]:
label review scores positive negative neutral compound
0 neg how do films like mouse hunt get into theatres... {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co... 0.101 0.121 0.778 -0.9125
1 neg some talented actresses are blessed with a dem... {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com... 0.105 0.120 0.775 -0.8618
2 pos this has been an extraordinary year for austra... {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com... 0.150 0.067 0.783 0.9953
3 pos according to hollywood movies made in last few... {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co... 0.145 0.069 0.786 0.9972
4 neg my first press screening of 1998 and already i... {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com... 0.088 0.090 0.822 -0.7264

In [33]:
print(df.iloc[15]['review'])


here's a rarity : a children's film that attempts to tackle a weighty subject , is there a god ? 
done well , it could have been a gem among the wasteland of modern children's cinema . 
unfortunately , it isn't . 
with jumbled messages , and an unclear audience , wide awake was better left asleep . 
fifth grader joshua beal ( joseph cross ) is in the middle of a moral crisis . 
his beloved grandfather ( robert loggia ) has died , and joshua has begun a quest . 
he wants to find god , to discover why bad things happen . 
this religious quest is slightly disturbing for his parents ( dana delany and denis leary ) , but they do their best to cope with their son as he explores different religious faiths . 
at his catholic school , his favorite teacher , sister terry ( rosie o'donnell ) , tries to give him guidance , but this is a journey he must make on his own . 
meanwhile , he is having the most momentous year of his life . 
he has several adventures with his daredevil best friend dave ( timothy reifsnyder ) , he gets his first crush , and begins to wake up to the world around him while he is on his spiritual journey . 
it is somewhat confusing as to what the real audience for wide awake is expected to be . 
on its surface , it appears to be a kid's film . 
however , it deals with serious issues , and is likely to be boring for today's instant-gratification kids . 
and while it might seem heartening to see that someone is trying to produce something thoughtful for the kidvid audience , wide awake asks serious questions , but only delivers a cheap gimmick for an answer . 
if there were a bit more meat in the story , adults on a nostalgic bent might get a kick out of the movie . 
the actors who might have created a great cast ( o'donnell , leary and delany ) are wasted in roles that amount to little more than cameos . 
the nostalgic elements ( best friend , favorite teacher , first crush , etc . ) have been done much better in other movies , and actually seem more like filler here . 
the film's strongest scenes are some touching flashbacks depicting joshua's relationship with his grandfather . 
they show more depth than is present anywhere else in the movie . 
maybe the film would have been better if , instead of playing the relationship through flashbacks , it were set entirely during joshua's last year with his grandpa . 
it certainly would have been more entertaining . 
wide awake can best be described as a failed experiment . 
it starts out with noble aspirations , but never delivers on its promise . 
parents who do take their children to see this one ought to be prepared to answer some tough questions . . . that is if their kids aren't bored to death first . 


In [16]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [17]:
accuracy_score(df['label'],df['comp_score'])


Out[17]:
0.6367389060887513

In [18]:
print(classification_report(df['label'],df['comp_score']))


              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

   micro avg       0.64      0.64      0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938